-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loom test for deadlock observed in tokio's test suite #6876
Conversation
…on_queue_depth_multi_thread test
Yep, that looks like it catches the bug. |
Given Carl's suggestion here, I'm wondering if the loom test will succeed or not with one worker thread. |
I think the bug requires at least one worker thread. I think the solution is to make sure that when the runtime is started, (at least one of) the worker threads should be in a searching state. |
If we have only one worker thread and it were to deterministically always run the tasks in the order in which they are spawned, I'd expect the test to always deadlock. Even if it runs the task not deterministically, I believe a deadlock would happen with a lot more frequency (still every time the first task runs before the second) than it does with two or more workers. The deadlock—as far as I see it—happens when the second worker is parked while the second task (which when executed would unlock the deadlock) is still in some queue elsewhere, the first worker being blocked by the first task, unable to notify the second worker to unpark again. |
Looking into the deadlock scenario more closely (using loom's checkpoint feature), I suspect that the following is happening:
|
Thanks for looking into this. In theory, when Worker A finds a task, it transitions out of searching. If it is the last searching worker, it notifies a sleeping worker to wake up and try searching. In theory, WorkerB would wake and process the task. However, that logic is (intentionally) racy. Here, before parking, you can see a safeguard that handles this case. In the case of a worker transitioning out of searching because it finds work, the intent is that once the task is done being polled, the worker will find the rest of the work, thus mitigating the race. In this loom test, the task blocks forever, so the runtime deadlocks. Unfortunately, there isn't much we can do in this specific case. If we want to make the runtime "bulletproof," the best strategy would be to add some logic to detect blocked tasks, report a warning, and poke the runtime to get unstuck. We may want to rewrite the original flaky test to avoid blocking the runtime. Thanks for looking into it, though. I hope that you found it a good learning experience. I'm also happy to answer any further runtime-related questions. |
Yes, I suppose this code is just for testing purposes, so I'm sure that it would be a different scenario with an ideal async task that does not block for a long time.
I agree. @jofas Are you still interested in this issue? (If not, I can do that test fix) |
I'd be interested in fixing the test, but I'm not exactly sure how we can determine the injection queue depth without blocking the two workers so that they can't consume any tasks from it. My initial idea was to keep the workers busy by continuously filling their LIFO slots with tasks so that they don't start consuming the injection queue, but I don't know if that'd work and even then it sounds a bit brittle to me. My second idea was to have only a single worker, block it like we already do and then fill up the injection queue, but I'm not sure if that'd be considered good enough of a test for the multithreaded runtime. How would you fix the test, if you don't mind me asking? |
I think it's okay to say that if the test doesn't succeed within some timeout, you just restart the test. Then retry up to, let's say, 10 times. (You probably have to get the blocking tasks to exit to restart the test.) |
I'm guessing this should be closed now that we merged #6916? |
I agree. Should this discussion be summarized somewhere more prominently, in case people actually trigger the deadlock in their codebase? |
We already have open bugs about tolerating blocking tasks. |
This PR adds a Loom test for the deadlock observed in #6847.
When I run this test locally on my machine with
I get the following error:
which I believe signifies that the test is able to successfully replicate the deadlock.
I used oneshot channels instead of barriers as is done in the flaky test where the deadlock was first observed, because Loom currently does not support
Barrier
s.I'm opening this up as a Draft PR because I'm looking for early feedback on whether I'm on the right track here or if I have misunderstood the assignment.